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What are the features that impersonators select to elicit a speaker's identity? We built a voice database of 
public figures (targets) and imitations produced by professional impersonators. They produced one 
imitation based on their memory of the target (caricature) and another one after listening to the target audio 
(replica). A set of naive participants then judged identity and similarity of pairs of voices. Identity was better 
evoked by the caricatures and replicas were perceived to be closer to the targets in terms of voice similarity. 
We used this data to map relevant acoustic dimensions for each task. Our results indicate that speaker 
identity is mainly associated with vocal tract features, while perception of voice similarity is related to vocal 
folds parameters. We therefore show the way in which acoustic caricatures emphasize identity features at the 
cost of loosing similarity, which allows drawing an analogy with caricatures in the visual space. 

Speech contains a great deal of information that goes above and beyond its semantic content. Gender, 
approximate age and affective state of the speaker can be easily and reliably extracted even from small 
speech samples 1 . Speakers' identity can also be recognized, selecting robust properties from acoustically 
flexible voices 2 . The non -linguistic information used for these kind of tasks depends on two different classes of 
factors that shape the human voice: extrinsic factors, determined by culture and speaking habits, as the speaker's 
accent, and intrinsic factors which depend on the anatomy and physiology of the vocal system 3 . Here we 
concentrate on intrinsic factors, which are more difficult to imitate than extrinsic ones. 

Humans are natural vocal imitators. We copy aspects of other human voices, which is an essential process to 
acquire language 4 ' 5 and also incorporate to our lexicon sounds of nature in the form of onomatopoeias. This 
imitation process by which arbitrary sounds (for instance, a knock sound) become vocalized is strongly con- 
strained by the physiology and anatomy of the vocal system 6 . Similarly, although there is flexibility and versatility 
in vocal impersonation 7,8 , this process is constrained by the individual voice production system. The investigation 
of impersonation, its success and failure, is an empiric manner to address the problem of what determines vocal 
identity. 

During normal speech, voiced sounds are produced by the combined action of the vocal folds and the vocal 
tract 9 . The vocal folds are a pair of elastic membranes located at the glottis that can be set into oscillatory motion 
by the transfer of energy from the air expelled from the lungs. The perturbed airflow produced by these oscilla- 
tions is then injected into the vocal tract, formed by the set of cavities extending from the glottal exit to the lips, 
whose shape is actively controlled by different articulators as the tongue and jaw 1011 . The sound wave propagates 
back and forth along the tract, that acts as a waveguide for the sound. From a spectral point of view, the oscillations 
of the vocal folds provide a rich sound source characterized by a fundamental frequency f 0 (pitch) and decaying 
harmonics, and the vocal tract is defined by its resonant frequencies Fi (formants). 

Although voiced sounds result of the combined action of vocal tract and vocal folds, both blocks act rather 
independently during normal speech, because the folds are not appreciably affected by the re-injection of sound 
from the tract, which is known as source-filter theory 9 . This has consequences on the uttered sounds: from the 
spectrum of a voiced sound we can extract parameters related specifically to the dynamics of vocal folds or to the 
anatomy of the vocal tract. 

Different vocal anatomies produce different voices. For instance, the female and male typical vocal folds vary in 
size, producing female voices with higher pitch and formants than male voices. However, although we are good at 
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recognizing speakers' identities, we can be deceived by vocal imper- 
sonators. What are then the vocal features that they select to recreate 
the identity of a speaker? In this work we address this question by 
identifying acoustic parameters relevant to evaluative tasks of voice 
and speaker perception. 

Results 

We investigate whether voices and identities could be represented in 
low- dimensional spaces, identifying the acoustical parameters that 
are perceptually important across subjects. Building on previous 
efforts that analyzed the cases of one impersonator imitating differ- 
ent targets 8 and different impersonators imitating a single target 12 , we 
created a database of 3 different sentences, each one pronounced by a 
different public figure (targets T x to T 3 ) along with the corresponding 
imitations produced by 5 professional impersonators (I x to I 5 ). Using 
written versions of the sentences, the impersonators first recorded 
them with their own normal voice (n). Then, again from the written 
sentences, the impersonators used their memory to imitate the cor- 
responding targets. These imitations rely on internal voiceprints or 
caricatures (c) of the public figures' voices that are used by the 
impersonators to build their imitations. Finally, the impersonators 
replicated the sentences (r) just after listening to the target audio files. 

Henceforth, we refer to the caricature and replica produced by 
impersonator a of the target b as l a c\, and I a r b respectively. 
Similarly we refer to the natural voices produced by impersonator 
a of the sentence produced by target b as I a n b . The complete voice 
database consists of 48 audio files: targets Tp impersonators' natural 
voices IiHp caricatures I { Cj and replicas J z r ; for each public figure j (1 < 
j < 3) and impersonator i (1 < i < 5) (see Methods: voice database 
for details. The targets T x and T 2 along with their replicas and car- 
icatures can be found as Supplementary Audio files SI to S22). 

All the results reported here belong to three classes: 1) psychophy- 
sical measures of voice similarity and speaker identity (to determine 
whether an impersonation is successful or not), 2) auditory prop- 
erties of speech, and 3) the relation between auditory properties and 
psychophysical measures of similarity to identify auditory signature 
of vocal identity. 



Experiment 1: psychophysical measures of identity. Each subject 
heard a single target and all its imitations in random order, and gave a 
rating indicating how likely the voice they had just heard belonged to 
the public figure in question. They used a scale ranging from 1 (I am 
sure the voice does not belong to the public figure) to 5 (I am sure the 
voice belongs to the public figure) (see Methods: experiment 1 for 
details). 

The 3 targets showed high average ratings of belongingness (7\ = 
3.9 ± 0.4, T 2 = 3.5 ± 0.5 and T 3 = 4.8 ± 0.3) which testifies that the 
voices of the public figures were easily recognizable for the popu- 
lation used in this study. Next we investigated the effect of three 
independent factors, 1) Impersonator (I x to J 5 ), 2) Type of imper- 
sonation (caricature or replica) and 3) Impersonated character (T 1} 
T 2 or T 3 ) by submitting the rating data to an ANOVA with these 
three factors as fixed variables. The ANOVA (Table 1) revealed that 
the three factors had significant effects. We followed these main 
dependence with post-hoc /-test to identify how these factors affected 
the rating. 

First, by pooling together all impersonators and impersonation 
type, we investigated whether some targets where easier to imitate. 
Results showed average imitation ratings of 2.58 ± 0.14, 2.81 ± 0.16 
and 1.72 ± 0.12 for T l5 T 2 and T 3 respectively. The distribution of 
ratings can be found in the upper panels of Figure 1. Comparisons of 
these distributions Bonferroni corrected for multiple comparisons 
showed a significant difference ratings for T 3 compared to the other 
two targets (both comparisons p corr < 0.001). T 3 is Diego Maradona, 
a world-wide public figure for around 25 years. In fact, 62% of the 
impersonations of T 3 ranked as 1 ("Jam sure the voice does not belong 
to the public figure'), which shows that psychophysical thresholds of 
acceptance of vocal identity depend -as could be expected- on the 
degree of knowledge of the impersonated target. Given that T x and T 2 
had similar and broad distributions of ratings (and also similar per- 
iods of public activity, around 5 years) and that T 3 distribution was 
different and saturated towards a strong recognition of dissimilarity 
with the target, we restrict our subsequent analyses mainly to the 
targets T x and T 2 and their imitations. 

Next, we submitted the data to independent ANOVAS for each 
target with impersonator and imitation type as independent factors 






Target 2 . 


n 




1 — 1 




1 2 3 4 5 



1 2 3 4 5 



Target 1 



Target 2 




Tl II 12 13 14 15 




T2 II 12 13 14 15 



Figure 1 | Experiment 1: at the behavioural level, speaker identity is better elicited by caricatures (blue) than replicas (green). Each participant listened 
to the set of audio files containing a single target and its imitations (caricatures and replicas) and associated them with the identity of the corresponding 
public figure using a scale from 1 (the voice does not belong to the public figure) to 5 (the voice definitely belongs to the public figure). In the upper panels 
we show the distributions of gradings for the 3 targets T u T 2 and T 3 across imitation types. In the lower panels, we show the grades (mean ± sd) 
for the targets (red), replicas (green) and caricatures (blue) for T x and T 2 . 
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(Table 1). ANOVAS revealed for both targets an effect of imperson- 
ator (which merely reveals that certain impersonators produce higher 
quality imitations and caricatures) and, more interestingly, a signifi- 
cant effect of impersonation type. A follow up of this ANOVA 
revealed that for both targets, the effect of type was accounted by 
an increase in rating for caricatures than for replicas with average 
imitation ratings of 2.93 ± 0.09 and 2.03 ± 0.07 respectively (7\: t = 
8.84, df= 179, p < 10" 4 ; T 2 : t = 3.51, df= 84, p < 7.2 10" 4 ). 

These results indicate that caricatures (produced to recreate indi- 
vidual traits of the speaker) result in voices which are more typically 
accepted as belonging to the targets than replicas (attempts to pro- 
duce faithful copies of the target). In other words, vocal caricatures 
serve the purpose of speaker recognition better than auditory 
replicas. 

There are two possible interpretations of these results: a parsimo- 
nious interpretation is that impersonators are simply not trained to 
copy specific utterances, and their replicas are simply bad copies of 
the targets. Another more interesting possibility is that they are 
efficiently reproducing features of the speakers' voice, different from 
the ones that code identity. In order to investigate this last hypo- 
thesis, we designed an experiment to confirm that the replicas are 
indeed good at copying voices which fail in focusing in the relevant 
auditory dimensions encoding speaker identity. 

Experiment 2: psychophysical measures of voice similarity. For 

each target, we selected the 3 impersonators that produced the 
higher ranked caricatures and lower ranked replicas (marked in 
bold type in Fig. 1). The rationale behind this choice was that we 
wanted to test the hypothesis that replicas are efficient imitations of 
the target voice even when they may not focus in the most salient 
dimensions for identity. 

Participants listened to all pairs of audio files and ranked their 
similarity using a scale from 1 (the two voices are very different) to 5 
(the two voices are the same) (see Methods: Experiment 2). For each 
target, the perceptual data produced by each subject can be organized 
in a similarity matrix M in which the element My corresponds to the 
similarity rating between the pair of audio files i and j. We selected 
the pairs containing the target files and submitted this data to 
ANOVAs with impersonator and imitation type as independent 
factors (Figure 2 and Table 2). The analyses revealed an effect of 
imitation type. A follow up of this analysis showed that for both 
targets, this effect was accounted by an increase in rating for replicas 
with respect to caricatures (7\: t = 6.99, df= 51, p < 10~ 4 ; T 2 : t = 
3.13, df= 47, p = 0.03), with average imitation ratings of 3.31 ±0.15 
and 2.52 ± 0.20 for replicas and caricatures of 7\, and 3.19 ± 0.20 
and 1.71 ± 0.12 for replicas and caricatures of T 2 . This indicates that 
replicas are quality copies of the targets' voices and are indeed per- 
ceived as more faithful reproductions of the original voice than car- 
icatures, although they fail in eliciting the identity of the speaker. 

Acoustic spaces of similarity and identity. Our next aim was to find 
the acoustic figures that govern the perception of similarity and 
identity. For each file in the original database, we calculated the 
12 -dimensional acoustical vector V = {jitter, shimmer, f 0 , Fi (1 < i 
< 5), disp(F 5 - F x ), disp(F 4 - F 3 ), disp(F 5 - F 3 ), disp(F 5 - F 4 )}, 
using mean values calculated over the length of the sentence (see 
Methods: acoustic space for details). The pitch f 0 and the formants 
Fj provide the most direct information about the anatomy of a vocal 
system: the first one is related to the mass and elasticity of the folds, 
while the formants reflect the vocal tract shape. We included two 
additional vocal folds' parameters: jitter and shimmer, that measure 
the cycle-to-cycle variations of frequency and amplitude, respec- 
tively. These two parameters have been historically used for quali- 
tative descriptions of voice pathologies and, more recently, have been 
shown to be strongly associated with voice perception by naive 
listeners 13,14 . This is also the case for the formant dispersions 
disp(Fj — F/) 13 " 16 , calculated as the mean interval between formant 



frequencies, that we included as well (see Methods: acoustic space for 
details). Hence, each audio file is mapped to a 12 -dimensional vector 
and we can then measure how subjective ratings of speaker identity 
and voice similarity co-vary with its different dimensions. 

In experiment 1 we had only one scalar (subjective rating) asso- 
ciating each voice to the target. The results showed a broad variabil- 
ity, with some voices being systematically associated and others never 
confounded with the target (respectively high and low subjective 
ratings in Figure 1). Within this variability, analysis clearly showed 
that caricatures were closer to the target than replicas. Within the 
caricatures, two impersonators showed particularly efficient imita- 
tions of the target. In fact, the higher ranked caricatures (marked with 
* and + in Fig. 1) were non distinguishable from the targets in 
subjective ranking of belongingness for both 7\ and T 2 (P < 0.05, 
Friedman test). 

To reveal the relevant acoustical variables for speaker recognition, 
our experimental approach was to identify in which acoustic vari- 
ables, the efficient caricatures were proximal to their corresponding 
targets. To this aim we performed the following analysis: first, for 
each target, we mapped the 1 1 audio files (the target along with its 5 
replicas and 5 caricatures) to all possible combinations of 2-dimen- 
sional planes of the original 12 -dimensional acoustical space (a total 
of 66 spaces). For each plane, we measured classification accuracy as 
the percentage of imitations that were located farther to the target 
than the two highest subjectively ranked caricatures (the t- value that 
the highest ranked caricatures were closer to the target than the other 
imitations were identical yielded identical results). 

The results are shown in the upper panels of Figure 3. The dimen- 
sion disp(F 5 — F 4 ) is a strong marker of identity, as it shows a good 
performance for both targets: for 7\, the spaces that result from the 
combination of this dimension with shimmer or jitter or F 5 or disp(F 5 
— FJ are such that the best caricatures are more similar to the target 
than the rest of the imitations. For T 2 , the same holds for the com- 
bination of disp(F 5 — F 4 ) with F 5 , F 2 or disp(F 5 — F 3 ). We summarize 
these results in the histogram of Figure 3, where we show the number 
of spaces as a function of the correctly classified imitations for both 
the targets. Only 3 acoustical spaces perform significantly (>2 sd) at 
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Figure 2 | Experiment 2: at the behavioural level, voice similarity is better 
elicited by replicas (green) than caricatures (blue). Each participant 
listened to all pairs from a set of M = 10 audio files (M(M + l)/2 = 55 
audio pairs) composed by a target T and the caricatures, replicas and 
normal voices of the 3 impersonators with the highest ranked caricatures of 
experiment 1 (bold face in Fig. 1). Participants were asked to rate the voice 
similarity of each pair using a scale from 1 (the two voices are very 
different) to 5 (the two voices are the same). For both targets, replicas 
(green) display higher grades than caricatures (blue) for voice similarity, 
opposite to the results of experiment 1 for speaker identity shown in 
Figure 1. 
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Figure 3 | Experiment 1: at the acoustical level, speaker identity is strongly related to vocal tract features. Upper panels: for every 2 -dimensional 
acoustical space, we measured the classification accuracy as the percentage of imitations that are located farther from the target than the highest ranked 
caricatures (marked with + and * in Fig. 1 ) (left) . We summarize these results for both targets T\ and T 2 in a histogram showing the number of acoustical 
spaces as a function of their classifying performances (right). Lower panels: organization of the targets and imitations in the 2-dimensional space (F 5 , 
disp(F 5 — F 4 )), where the highest ranked caricatures of experiment 1 are closer to the corresponding target than the rest of the files for both T x and T 2 . 



locating the best caricatures closer to the targets than the rest of the 
imitations: (shimmer, disp(F 5 — F 4 )), (jitter, disp(F 5 — F 4 )) and (F 5 , 
disp(F 5 — F 4 )), and only the last one displays a perfect performance. 
The organization of files for this case is shown in the lower panels of 
Figure 3. 

In experiment 2 the procedure to correspond auditory to psycho- 
physical dimensions is easier because the experiment produced a 
similarity matrix of dimensionality comparable to the auditory space. 
Hence, we performed a multidimensional scaling analysis 131417 (see 
Methods) that allows maximizing the fitting of the dissimilarity mat- 
rices to an euclidean space where audio files are organized according 
to the experimental perceptual distances (upper panels of Figure 4). 
Note that even if the data is collapsed to such a low dimensional 
space, this representation allows visualizing the effect of imitation 
type, with replicas (green) closer to the target (red) than caricatures 
(blue), and also the rough organization of the two imitation types in 
two clusters. We also show the natural voices of the impersonators 
(pink). From their different distributions with respect of the target 
voices (far from 7\ and relatively close to T 2 ), we conclude that the 
relation we observed of replicas being perceived closer to the target 
than caricatures is not directly related to the proximity between the 
natural voice of the impersonators and the targets. 

We then submitted these perceptual embeddings to a redundancy 
analysis in order to explain as much variance as possible using the 
acoustical parameters (see Methods). In the lower panels of Figure 4 
we show the biplots for T x and T 2 . Although there are contributions 



from a variety of acoustic parameters, in both cases the most salient 
ones are/o and jitter, that correlate with dim\ and dim! respectively. 
In the case of Ti,f 0 correlates with dim! (r = 0,88) and jitter with 
diml (r = 0,78). In the case of T 2 ,fo correlates with dim\ (r = 0,91) 
and jitter with diml (r = —0,89). 

These acoustical analyses allow drawing a rough correspondence 
between the different types of imitation and the main building blocks 
of the vocal system: for the construction of replicas, impersonators 
focus mainly in vocal folds' properties, producing quality copies of 
the original voice. Caricatures, on the other hand, are constructed 
using robust vocal tract features, and the voices produced elicit the 
identity of the impersonated public figure. 

Discussion 

In this work we investigated the auditory features that strongly relate 
to speaker identity and voice similarity. We capitalized on the ability 
of professional impersonators to generate voices which can simulate 
another person's identity. 

Our two most important findings are: 1) at the behavioural level, 
replicas are more likely to be perceived as identical to the acoustic 
target than caricatures. Instead, when listeners are focused on the 
identity of the speakers whose voices they have been long exposed to, 
caricatures are more likely to be associated with the speaker than 
replicas and 2) at the acoustical level we identified different dimen- 
sions which are relevant for each task; the information used by lis- 
teners to judge voice similarity is related to the vocal folds (f 0 and 
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Figure 4 | Experiment 2: at the acoustical level, voice similarity is mainly associated with vocal folds' features. Upper panels: 2 -dimensional spaces 
resulting from the INDSCAL analysis, summarizing the perceptual organization of files from experiment 2 on voice similarity. For each target (red), we 
show the impersonators' caricatures (blue), replicas (green) and normal voices (pink). We further submitted this data to a redundancy analysis to find the 
combination of acoustic parameters explaining the variance of the perceptual data. The resulting biplots are shown in the lower panels with the vectors 
indicating the correlation between the main acoustic parameters and the axes of ordering space. 



jitter) and speaker identity is mostly associated with the vocal tract 
feature disp(F 5 — F 4 ). Maximal categorization is observed when it is 
combined with another vocal tract feature (the formant F 5 ), but high 
classifying performance is also observed when disp(F 5 — F 4 ) is com- 
bined with the vocal fold parameters shimmer and jitter. 

The higher formants (F 4 and F 5 ) were good candidates to index 
identity because they are more stable than the lower formants (i^, F 2 
and F 3 ) along the utterances 1015 . This can be seen as two independent 
channels to communicate semantic content and identity: the lower 
formants vary to produce utterances that encode linguistic informa- 
tion, typically vowels indexed in the space (F l5 F 2 ) 9 . While these 
formants vary along the sentences, higher formants remain largely 
unchanged, indexing stable properties of the discourse such as 
speaker identity. A prediction of this model, which can be examined 
in future studies, is that identity recognition should not be impaired 
for dysphonic voices, generated using turbulent noise as the sound 
source, without any vocal folds activity. 

In a previous study, Baumann and Belin 13 asked participants to 
listen to pairs of voices and determine whether they belonged to the 
same person. To make this judgment, participants relied on voice 
similarity and speaker identity (that was unknown to the partici- 
pants). In very close coincidence to our findings, they found that 
parameters f 0 and disp(F 5 — F 4 ) presented dominant contributions 
in nearly orthogonal dimensions of the perceptual space. Our work 
can be seen as a zoom in on this study, by separating identity (as 
stored in memory of a known voice) and similarity (as auditory 
proximity of two consecutive voices). We identify the same compo- 
nents f 0 and disp(F 5 — F 4 ) as being key auditory features with dif- 
ferent roles: disp(F 5 — F 4 ) encodes identity and/ 0 similarity. This 
results from two combined analyses: first, by identifying that 



similarity and identity are different processes, showing that two dif- 
ferent types of imitation (replicas and caricatures) differently affect 
these tasks. Second, by relating variability in perceptual performance 
(in similarity and identity) with variability in the auditory features of 
the voices. 

The poor results obtained with the T 3 (Figure 1) suggest that the 
period of exposure to a speaker's voice is critical to separate different 
scenarios of speaker recognition. One in which impersonators con- 
vince their listeners by copying specific voice features of the original 
voice, and another where a separate consideration of the acoustic 
parameters is not enough for eliciting the identity of the speaker 18 . 

Humans use faces and voices as strong identity carriers. Although 
the performance is quite poor for speaker recognition compared to 
face recognition 1 , some parallels can be traced between the visual and 
acoustic perceptual processing, as was recently suggested by the 
reconstruction of visual and speech objects from neural populations 
using similar models 219 . 

An overall conclusion of our experiments is that acoustic carica- 
tures seem to emphasize identity features at the cost of loosing sim- 
ilarity, which allows drawing an analogy between acoustic and visual 
caricatures. However, this requires a note of caution. In our work we 
did not work on the idea of exaggeration of vocal features, which is 
the first thing that naturally comes into mind when one uses the 
metaphor of caricatures. Instead, our focus was on identifying the 
acoustical dimensions in which caricatures are effectively proximal 
to the targets. 

Methods 

A total of 128 native Spanish speakers (82 females, age 32 ± 13) with normal hearing 
and no vocal training participated in 2 perceptual experiments. A total of 5 native 
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Spanish speakers (0 females, age 34 ± 7) participated in the construction of the voice 
database. All the participants signed a written consent form. 

All the experiments described in this paper were approved by the ethics committee 
Comite de Etica del Centro de Education Medica e Investigaciones CHnicas 'Norberto 
Quirno' (CEMIC) qualified by the Department of Health and Human Services (HHS, 
USA): IRb00001745-IORG 0001315. 

Voice database. We constructed an audio database containing 3 sentences 
pronounced by public figures (targets 7\, T 2 and T 3 ) and the imitations recorded by 5 
professional impersonators. The public figures were selected out of the common 
repertoire of the 5 professional impersonators: a TV entertainer (7\), a former 
Argentinian president (T 2 ) and a world-wide famous former soccer player (T 3 ). 
Target audio files were extracted from public access audiovisual media, selecting the 
best quality audio files (sampling frequency of at least 22.05 kHz and low 
reverberation effects). The sentences were chosen from the targets in normal speech 
situations and their content was selected to avoid inducing strong emotions or stress 
during impersonation. 

The impersonators produced imitations of each sentence in 3 different conditions: 
first, using written versions of the sentences, they recorded them with their normal 
voices (n) and impersonating the corresponding targets (caricatures c). Finally, they 
recorded imitations produced right after listening to the targets (replicas r). In this 
way, we recorded and stored a database of 48 audio files {7} + IfUj + IjCj + Ifj}, 1 < j < 
3, 1 < i < 5 (4.9 ± 2.3 s mean duration). 

At the moment of the database construction, the professional impersonators 
worked at the principal radio stations in Argentina. They were recorded in a low- 
noise room at a sampling frequency of 22.05 kHz, with a Takstar SGC568 micro- 
phone on Praat 20 . Audio files in the database were equalized in loudness. A low-level 
pink noise (power spectral density S(f) oc I If) was added to mask low frequency 
differences between the copies recorded at the laboratory and the original target files 
taken from audiovisual media. The audio files of targets 7\ and T 2 and their imitations 
are available as Supplementary Information on line. 

Experiments. The perceptual experiments were written in MATLAB, using 
Psychtoolbox 21 . Mono audio files at a sampling rate of 22.05 kHz were presented to 
the participants via headphones Logitech B530 USB Headset MS Line Optimi. 

Experiment 1: identity of the speaker. The participants declared to be able to recognize 
the characters by their voices. To avoid saliences coming from the topics associated 
with the public figures 22 , each participant completed the experiment for a single target 
and imitations, i.e. each participant listened to one single sentence uttered by the 
public figure and his different impersonators. N x = 36 participants listened to the set 
{7 1 ! + fc x + 1-rfS, N 2 = 22 to the set {T 2 + If 2 + I { r 2 } andiV 3 = 40 to {T 3 + If 3 + If 3 }, 
for a total of N = 98 participants (42 females, age 31 ± 10) with normal hearing and 
no vocal training. The participants were asked to associate the audio file with the 
identity of the target using the following scale: 1 (the voice doesn't belong to the public 
figure), 2 (it is unlikely that the voice belongs to the public figure), 3 (the voice 
probably belongs to the public figure), 4 (it is very likely that the voice belongs to the 
public figure) and 5 (I am sure that the voice belongs to the public figure). The results 
of the experiment are summarized in Figure 1 . We explicitly excluded the participants 
that did not recognize the voice of the corresponding public figure (that graded the 
target file with 1). 

Experiment 2: voice similarity. We selected the files of the 3 impersonators that 
produced the higher ranked caricatures and lower ranked replicas of experiment 1 . 
The set {7\ + (l x + I 4 + Z 5 )(c 1 + r x + n x )} was presented to N 1 = 17 participants (7 
females, age 27 ± 7) and the set {T 2 + (J 3 + I 4 + I 5 )(c 2 + r 2 + n 2 )} to N 2 = 13 
participants (5 females, age 25 ± 6). Each set consisted on M = 10 files, and the 
participants listened to all pairs M(M — 1)12 = 55 in the set in random order. The 
specific order of appearance of a given pair AB or BA was also randomized. The 
participants were asked to grade the similarity of the voices of each pair using the 
following scale: 1 (the two voices are very different), 2 (the two voices are different), 3 
(the two voices are similar), 4 (the two voices are very similar) and 5 (the two voices 
are the same). We excluded the participants whose matrices presented diagonal 
elements different from 5 (they graded identical files as not been identical). 

Data analysis. The audio files and perceptual data were subjected to the following 
analyses. 

Multidimensional scaling (MDS). A standard way to summarize a set of matrices 
containing dissimilarity measures is to fit them as distances in some kind of per- 
ceptual space (usually an euclidean, low- dimensional space) through multidimen- 
sional scaling. Several MDS models and techniques have been developed and applied 
to different musical and vocal spaces. Here we used a standard weighted Euclidean 
model in which the salience of each dimension is different for each subject 
(INDSCAL), as provided by Praat 23 . The detailed description of the method can be 
found elsewhere 13,14,24 . The perceptual spaces for target 7\ and T 2 are shown in the 
upper panels of Figure 4. 

Acoustic space. Each audio file was associated with the 12 -dimensional vector of 
acoustical parameters V = {jitter, shimmer, f 0 , Fi (1 < i < 5), disp(F 5 — F 7 ), disp(F 4 — 
F 3 ), disp(F 5 — F 3 ), disp(F 5 — F 4 )}, using the mean values of the parameters over the 
length of the sentence. The parameters were calculated using Praat 23 at recommended 
default values. One important question is if, beyond their mean values, the time 
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Figure 5 | For the short sentences of our database, prosodic contours do 
not contribute to the separation of caricatures and replicas. We show the 
prosodic contours of 3 variables: syllabic duration (upper panels), pitch 
(middle panels) and sound intensity (lower panels). Targets are shown in 
red, replicas in green and caricatures in blue. With the exception of pitch, 
the two types of imitation do not show stereotyped patterns nor clustering 
into different groups. 

evolution of the acoustic parameters is relevant to speaker and voice recognition. 
Although studies that focused on prosodic aspects were inconclusive 25 , some tem- 
poral properties as pitch/ 0 (f)> sound intensity I(t) and duration D{t) have been shown 
to be cues for differentiating voices 26 . In Figure 5 we show these time traces for targets 
and imitations of T Y and T 2 . With the exception of pitch, the prosodic contours of 
caricatures (blue) and replicas (green) follow similar patterns for the short sentences 
used in this work, and were excluded from the analysis. With respect to pitch, we use 
the mean values to account for the differences between caricatures and replicas. 

Redundancy analysis (RDA). We investigated which acoustic parameters contribute 
to explain the organization of the perceptual data in the acoustic space V. We sub- 
mitted the data of the perceptual 2-dimensional spaces calculated with INDSCAL to a 
redundancy analysis using the statistical toolbox Fathom for MATLAB 27 . The frac- 
tion of variance explained for T x is 59% and 3 1% for canonical axes (90% cumulative). 
For T 2 , the fraction explained is 61% and 38% for each canonical axis (98% cumu- 
lative). The most salient acoustic parameters are, for both targets, f 0 and jitter, that 
correlate with diml and dim! respectively. In the case of T x ,f 0 correlates with dim! (r 
= 0,88) and jitter with diml (r = 0,78). In the case of T 2 ,f 0 correlates with diml (r = 
0,91) and jitter with diml (r = —0,89). A posterior Monte-Carlo test showed sig- 
nification (p = 0.022 andp = 0.028 for T x and T 2 respectively), which implies that at 
least one of the parameters presents an effect in the ordering of the audio files. The 
ordination distance biplots are shown in the lower panels of Figure 4. 
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